Introduction to Machine Learning

Posted on April 16, 2019

This note is based on 2019 Spring COMPSCI189/289A course at University of California, Berkeley by Jonathan Shewchuk.

#Supervised Learning

Classification

example 1: classify digit 1 and 7

$N\times N$ pixels matrices
flatten into vector
create a classifier for $N\times N$ space
Note: Decision Boundary is a hyperplane

example 2: Bank loan default prediction

See notes for M3S17Quantitative Methods for Finance

Feature/Independent Variables/ Predictor Variables

Overfitting

The model is shaped to specifically to one certain data set so not predictive to new data.

When the test error worse coz the classifier becomes too sensitive to outliers or to other spurious/untrue patterns.

Sinuous decision boundaries that fit the sample points so well that it do not classify future points well.

Quantify Overfitting

Decision Boundary

The boundary chosen by classifier to separate items in class from those are not.

Decision Function

A function $f(x)$ that maps a sample point to a scalar value that
$$
f(x)>0 if x \in class C
f(x)\leq if x not \in class C
$$

For these decision function, the decision boundary is $f(x)=0$, usually a (d-1) dimensional surface in $R^d$.

Isosurface/iso contours

A isosurface for function $f$ is ${x: f(x)=0}$, 0 is isovalue here.
Note: ‘iso-‘ prefix means ‘equal-‘

Linear classifier

The decision boundary is a line/plane
Usually a linear decision function.

$$
x=(x_1,x_2,…,x_5)^T

Conventions:

Uppercase roman: matrix, random variables, set
Lowercase roman: vector
Greek: scalar
Other scalar:

n, number of sample points
d, number of features
i,j,k, integer indices
Function: f(), s(),…

Norms

Euclidean norms
Normalize a vector: $\frac{x}{|x|}$
dot product:
- length:$|x|=\sum x_i y_i$
- angle: $cos(\theta)=\frac{x\dot y}{|x||y|}$

hyperplane

Given a decision function $f(x)=w\dot x+\alpha$ is $H={x: w\dot x =-\alpha }$
The set H is a hyperplane.

property
For any x, y on H, $w\dot(y-x)=0$

normal vector w

signed distance$w\dot x +\alpha$, $w$ is unit vector
i.e. positive on one side of H, negative on the other side

Note: the distance from H to the origin is $\alpha$.
Note2: $\alpha =0$ iff H passes through origin

weights

coefficients in $w$ and$\alpha$ are called weights or regression coefficients

Linearly separable

the input data is linearly separable if there exists a hyperplane that separates all the sample planes in C from those not in C.

Centriod classifier

computer mean $\mu_c$ of all vectors in class C and meam $\mu_x$ of al vectors NOT in class C.

Decision function:
$f(x)=(\mu_c-\mu_x)\dot x-(\mu_c-\mu_x)\dot \frac{(\mu_c-\mu_x)}{2}$
$(\mu_c-\mu_x)$ is normal vector
$\frac{(\mu_c-\mu_x)}{2}$ is midpoint between$\mu_c, \mu_x$
so the decision boundary is the hyperplane that bisects \bar{\mu_c\mu_x}
good at: classify with samples from two gaussian/normal distributions, especially when sample size is large

Perceptron Algorithm

Slow but correct for linearly separable points.
Uses a numerical optimisation algorithm, namely the gradient decent.

Sample points $X_1,X_2,…,X_n$.
For each sample point, $y_i = 1(\in C)\ or\ -1(\notin C)$

For simplicity, assume the decision boundaries pass through the origin.

Regression

#Unsupervised Learning

Clustering

Dimensionality Reduction

Validation

Hold back a subset of training data for future test use–validation set(to tune hyperparameters/choose model).

test set: final evaluation

train a classifier multiple times, with different model/hyperparameter
test it on NEW data
choose the setting that gives the best validation result
?why: we want the model to be working in general data. In most cases(knn for example),input used data will always give the right output, which is not valuable for our evaluation of the model.

Types of error

training set error: fraction of training images not classified correctly
test set error:
fraction of misclassifying new data

outlier:

points are atypical

hyperparameters

~ control overfitting/underfitting(e.g. k in knn)